Skip to content

Conversation

gabe-l-hart
Copy link
Collaborator

@gabe-l-hart gabe-l-hart commented Aug 18, 2025

Fixes: #15401

cc @matteoserva from the original PR

Changes

  • Auto-detect whether enable_thinking should be true based on whether the rendered template changes when enabled
  • Always set input.enable_thinking from chat_template_kwargs.enable_thinking if set explicitly there
  • Fix the logical condition to reject agent-prefill when enable_thinking is true only when set by default for thinking models or explicitly via kwarg

Testing

All model types tested using the following set of calls

# Test without setting enable_thinking
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'

# Test with enable_thinking: true
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}'

# Test with enable_thinking: false
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": false}}'

Thinking model w/ thinking enabled

./bin/llama-server -hf ibm-granite/granite-3.2-8b-instruct-GGUF --jinja
# Reasoning enabled implicitly => error
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'
# {"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}

# Reasoning enabled explicitly => error
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}'
# {"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}

# Reasoning disabled explicitly => ok
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}' 
# {"prompt":"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: August 18, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

Thinking model w/ thinking disabled

./bin/llama-server -hf ibm-granite/granite-3.2-8b-instruct-GGUF --jinja --reasoning-budget 0
# Reasoning disabled implicitly => ok
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'
# {"prompt":"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: August 18, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

# Reasoning enabled explicitly => error
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}'
# {"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}

# Reasoning disabled explicitly => ok
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}' 
# {"prompt":"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: August 18, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

Non-Thinking model

./bin/llama-server -m ~/models/granite-3.0-2b-instruct/granite-3.0-2B-instruct-F16.gguf --jinja
# Reasoning disabled implicitly => ok
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}]}'
# {"prompt":"<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

# Reasoning enabled explicitly => error
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}'
# {"error":{"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}}

# Reasoning disabled explicitly => ok
curl http://localhost:8080/apply-template -d '{"messages": [{"role": "user", "content": "hello world"}, {"role": "assistant", "content": "hi hi"}], "chat_template_kwargs": {"enable_thinking": true}}' 
# {"prompt":"<|start_of_role|>user<|end_of_role|>hello world<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>hi hi"}

common/chat.cpp Outdated
return COMMON_REASONING_FORMAT_DEEPSEEK;
} else if (format == "deepseek-legacy") {
return COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
} else if (format == "granite") {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this was missing, so figured I'd add it (not strictly required for this PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would recommend using auto instead of adding a new type here (so the existing enum COMMON_REASONING_FORMAT_GRANITE should be removed altogether - it is not actively used anywhere in the code base)

reasoning_format is supposed to be used for selecting the output format, i.e. the JSON response format from server to client. It should have no effects to the content.

We should better document this to prevent future contributor from thinking that reasoning_format meaning "template format", while it should be "server response schema"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll yank this out.

For the life of me, I couldn't find where "auto" was actually being used. I found several places where there was a special case for _NONE or _DEEPSEEK_LEGACY, but I couldn't find _AUTO anywhere other than being set as the default. Can you clarify how this gets used to auto-determine the right reasoning parser?

Copy link
Collaborator

@ngxson ngxson Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long story is that the message.reasoning_content API response format is initially introduced by deepseek hosted API, hence the name deepseek

This enum was added so in case openai wants to have their own API schema, then we can select between either deepseek or openai. However, the meaning is lost in time due to a combination of bad naming and poor documentation.

auto was added as a no-brainer solution since deepseek reasoning_content is now supported by many applications, while OAI migrated to their new Response API.

This reasoning_format solely determines the API schema, it is unrelated to the notion of parser or anything at chat template layer. Parser is determined by the chat template itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's very helpful context!

// 2. The chat template supports it
bool enable_thinking = params_base.reasoning_budget != 0;
if (enable_thinking && params_base.use_jinja) {
common_chat_templates_inputs dummy_inputs;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could move into a free-function in chat.* (something like common_chat_supports_enable_thinking).

Also, an alternate implementation of this would be to be more explicit and do some kind of a switch on the subtype of common_chat_params, but that felt hard to maintain (similar to more places where an arch enum is required). Since this is a boot-time operation, it feels ok to do the extra template expansion to avoid another place where a developer would need to poke in architecture-specific logic.

@gabe-l-hart gabe-l-hart force-pushed the gabe-l-hart/thinking-model-disabled-agent-prefill branch from 363c664 to 17b4bf8 Compare August 18, 2025 19:39
@gabe-l-hart gabe-l-hart changed the title Gabe l hart/thinking model disabled agent prefill Thinking model disabled agent prefill Aug 18, 2025
@gabe-l-hart gabe-l-hart force-pushed the gabe-l-hart/thinking-model-disabled-agent-prefill branch from bb48ad1 to 0f07f58 Compare August 18, 2025 21:04
@arichiardi
Copy link

arichiardi commented Aug 19, 2025

Tried the following:

LLAMA_CHAT_TEMPLATE_KWARGS='{\"enable_thinking\":false}' ✔️

Request containing {"chat_template_kwargs": {"enable_thinking":false}} 🔴

EDIT: I believe I am still getting got exception: {"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"} when the request contains it.

@gabe-l-hart
Copy link
Collaborator Author

Thanks for testing it @arichiardi. Any chance you can post both the command to launch the server and the full curl command for the request so I can try to repro?

@gabe-l-hart
Copy link
Collaborator Author

@ryan-mangeno if you have any cycles for testing, I'd love some help putting this one through its paces

@arichiardi
Copy link

@gabe-l-hart this is my call (copied from gptel)

curl \
--disable \
--location \
--silent \
--compressed \
-XPOST \
-y7200 \
-Y1 \
-D- \
-w\(ef61b92be715ea1d69de4307f3024c21\ .\ \%\{size_header\}\) \
-d\{\"model\"\:\"GLM-4.5\"\,\"messages\"\:\[\{\"role\"\:\"user\"\,\"content\"\:\"Can\ you\ make\ images\?\"\}\]\,\"stream\"\:true\,\"temperature\"\:1.0\,\"chat_template_kwargs\"\:\{\"enable_thinking\"\:\"false\"\}\} \
-HContent-Type\:\ application/json \
-HContent-Type\:\ application/json \
http\://<your-host>:<your-port>/v1/chat/completions

@gabe-l-hart
Copy link
Collaborator Author

Ok, looking at this request, it doesn't look like this is actually stimulating the condition this PR is targeting since it does not end with an assistant turn. I can't run GLM-4.5 locally, but I ran the same request against Granite 3.2 8b without setting --reasoning-budget 0, and got clean responses.

I modified the above request to have an assistant turn at the end, and found that I was able to see what you report where I got the rejection when setting "chat_template_kwargs":{"enable_thinking":"false"}}. The key here is that "false" is being sent as a string not a bool. If I remove the "s around it to send it as a bool, the request goes through cleanly.

# Incorrectly sent as a string -> error
curl -XPOST -d'{"model":"UNUSED", "messages":[{"role":"user","content":"Can you make images?"}, {"role":"assistant","content":"Nope, I sure can not"}],"stream":false,"temperature":0.0,"max_tokens":30,"chat_template_kwargs":{"enable_thinking":"false"}}' -H "Content-Type: application/json" -H "Content-Type: application/json" http://localhost:8081/v1/chat/completions

# Correctly sent as a bool -> ok
curl -XPOST -d'{"model":"UNUSED", "messages":[{"role":"user","content":"Can you make images?"}, {"role":"assistant","content":"Nope, I sure can not"}],"stream":false,"temperature":0.0,"max_tokens":30,"chat_template_kwargs":{"enable_thinking":false}}' -H "Content-Type: application/json" -H "Content-Type: application/json" http://localhost:8081/v1/chat/completions

This behavior is a bit confusing given that the code is parsing the value as a string and explicitly checking against the string literal "true" and "false". I'll look a bit more to try to understand why sending it as a string does not match.

@gabe-l-hart
Copy link
Collaborator Author

Ok! I think I've figured out my issue and it comes down to misusing json_value. This serializes the field and always returns a std::string. The serialized representation of a json string, serialized into a std::string is "\"false\"" (note the quote characters in the string itself), so it is failing to match the check. I'll fix this shortly.

@gabe-l-hart
Copy link
Collaborator Author

I was not quite right before. json_value does not always return a std::string. The issue is here where the inputs.chat_template_kwargs mapping is constructed with .dump() to serialize all values to their string representation.

This opens up a different question: what is the correct way to handle a user sending an invalid type in a chat template kwarg? This seems ill-defined since chat template kwargs are model-specific except for "enable_thinking". It seems like there are three choices:

  1. Explicitly reject "stringified bools" as invalid requests for "enable_thinking"
  2. Intentionally parse "stringified bools" as if they were the corresponding bool (then you have to wonder about things like "False", "No", etc that are sometimes parsed as valid bools in some platforms)
  3. Keep the current implementation that silently treats them as unset (not the best UX!)

My usual inclination is to do "best effort" (2) in this case, but that often leads to loosy-goosy APIs, so the more correct thing to do would be (1).

@ngxson any thoughts on this one?

@arichiardi
Copy link

arichiardi commented Aug 20, 2025

@gabe-l-hart

I'd probably go for (2) myself. Additionally, if the param requires a false literal (as opposed to a string) we could just reject the request (fail fast, I like it).

Agree that leaving the value unset without any sort of feedback is sub-optimal.

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Aug 20, 2025

I've implemented the fail-fast path (reject any string in the "enable_thinking" kwarg) locally. I think this is probably more maintainable going forward, especially if additional kwargs with more complex types become standardized (eg documents from the HF apply_chat_template signature). It's also way easier to go from an error to a success in the future than the other way around from a backwards-compatibility standpoint.

Right now, any errors raised in oaicompat_chat_params_parse result in a 500. This is not really correct as this should be a 422 (Unprocessable Entity) (or simply a 400). I'm going to look into whether there's a way to raise errors with error-codes from the free-functions in utils.hpp.

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Aug 20, 2025

It looks like more-nuanced error types in server would require one of the following approaches:

  1. Define an exception type in utils.hpp that includes the HTTP error type to be used from this enum, then handle it correctly in the generic handling logic
  2. Add try / catch around the handlers for oai endpoints and explicitly check the error strings

I think (1) would be much better as it would be extensible to other methods within utils.hpp, but it would also be much more invasive. As such, I'm going to not do this as part of this PR and will wait for further thoughts.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
…value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart force-pushed the gabe-l-hart/thinking-model-disabled-agent-prefill branch from c0d4709 to 6adae51 Compare August 21, 2025 13:59
@arichiardi
Copy link

arichiardi commented Aug 21, 2025

@gabe-l-hart I build the new commits and I now receive the 500 error with this

EDIT: never mind, was using the wrong binary 😄

@gabe-l-hart gabe-l-hart changed the title Thinking model disabled agent prefill Thinking model disabled assistant prefill Aug 22, 2025
…g-model-disabled-agent-prefill

* origin/master: (76 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
@gabe-l-hart
Copy link
Collaborator Author

@ngxson any progress on this review? I'm also happy for anyone else to look it over if you're over-full.

@gabe-l-hart
Copy link
Collaborator Author

@ggerganov any chance you can look this one over?

const auto rendered_no_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs);
dummy_inputs.enable_thinking = true;
const auto rendered_with_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs);
enable_thinking = rendered_no_thinking.prompt != rendered_with_thinking.prompt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logic of this check - seems we would disable thinking if the 2 rendered prompts are different. Why would that be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, got it. This should be moved to a helper function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check in a few, but I think the logic as written is that if the two prompts are different (with and without thinking), the enable_thinking bool is set to true. If they match, the assumption is that the enable_thinking flag is ignored in the template. This is certainly a heuristic since some templates might only honor it under certain conditions (eg the funky Granite 3.x XOR conditions with documents and tools).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is moved to common/chat.* now in the function common_chat_templates_support_enable_thinking

…g-model-disabled-agent-prefill

* origin/master: (84 commits)
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802)
tests : add --list-ops and --show-coverage options (ggml-org#15745)
gguf: gguf_writer refactor (ggml-org#15691)
kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)
model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801)
chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639)
chat : nemotron thinking & toolcalling support (ggml-org#15676)
scripts : add Jinja tester PySide6 simple app (ggml-org#15756)
llama : add support for EmbeddingGemma 300m (ggml-org#15798)
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799)
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791)
CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763)
server: add exceed_context_size_error type (ggml-org#15780)
Document the new max GPU layers default in help (ggml-org#15771)
ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669)
CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784)
opencl: add hs=40 to FA (ggml-org#15758)
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760)
vulkan: fix mmv subgroup16 selection (ggml-org#15775)
vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724)
...
…o common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart merged commit 5fac79c into ggml-org:master Sep 5, 2025
48 checks passed
@gabe-l-hart gabe-l-hart deleted the gabe-l-hart/thinking-model-disabled-agent-prefill branch September 5, 2025 21:07
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 5, 2025
…upport

* origin/master:
Thinking model disabled assistant prefill (ggml-org#15404)
Implement --log-colors with always/never/auto (ggml-org#15792)
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802)
tests : add --list-ops and --show-coverage options (ggml-org#15745)
gguf: gguf_writer refactor (ggml-org#15691)
kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)
model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801)
chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639)
chat : nemotron thinking & toolcalling support (ggml-org#15676)
scripts : add Jinja tester PySide6 simple app (ggml-org#15756)
llama : add support for EmbeddingGemma 300m (ggml-org#15798)
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Thinking model with thinking disabled cannot use /apply-template with final assistant turn
4 participants